E-Commerce Analytics: Insights for Growth & Retention¶

Introduction¶

In this project, I explored online retail transactions to uncover customer behavior, top-selling products, customer segments, high-value buyers, and cross-selling opportunities. The goal is to provide actionable insights for marketing strategies that enhance retention and boost revenue.

Dataset Information¶

The Online Retail dataset from the UCI Machine Learning Repository contains ~542k transactions from a UK-based non-store retailer between Dec 2010 and Dec 2011. The company mainly sells unique all-occasion gifts, and many customers are wholesalers.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

Data Loading & Cleaning¶

First, I loaded the dataset, remove missing or invalid transactions, and create a new feature TotalPrice.

In [2]:
df = pd.read_excel("Online Retail.xlsx")
In [3]:
# Drop missing CustomerID and invalid transactions
df = df.dropna(subset=["CustomerID"])
df = df[(df["Quantity"]>0) & (df["UnitPrice"]>0)]

# Convert InvoiceDate to datetime
df["InvoiceDate"] = pd.to_datetime(df["InvoiceDate"])

# Feature: TotalPrice
df["TotalPrice"] = df["Quantity"] * df["UnitPrice"]

# Get the latest date
snapshot_date = df["InvoiceDate"].max()

print("Cleaned dataset shape:", df.shape)
Cleaned dataset shape: (397884, 9)

The dataset contains ~398k transactions across 9 columns.

In [4]:
df.head()
Out[4]:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country TotalPrice
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010-12-01 08:26:00 2.55 17850.0 United Kingdom 15.30
1 536365 71053 WHITE METAL LANTERN 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom 20.34
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 2010-12-01 08:26:00 2.75 17850.0 United Kingdom 22.00
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom 20.34
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom 20.34

Top Products Analysis¶

To understand revenue drivers, I aggregated sales by product.

In [6]:
top_products = df.groupby("Description")["TotalPrice"].sum().sort_values(ascending=False).head(10)
px.bar(top_products[::-1], x=top_products.values[::-1], y=top_products.index[::-1],
                      orientation='h', title="Top 10 Products by Revenue").show()

Products like Paper Craft, Little Birdie and Regency Cakestand 3 Tier are key contributors. Marketing campaigns could highlight these best-sellers to boost sales further.

Country-Level Revenue Analysis¶

Next, I explored which markets drive revenue.

In [7]:
top_countries = df.groupby("Country")["TotalPrice"].sum().sort_values(ascending=False).head(10)
px.bar(top_countries[::-1], x=top_countries.values[::-1], y=top_countries.index[::-1],
       orientation='h', title="Top 10 Countries by Revenue").show()

The UK clearly dominates sales. Marketing and bundling strategies should prioritize UK customers for maximum impact.

Monthly Revenue Trend¶

Looking at the last six months helps identify seasonality or stock issues.

In [8]:
# Last 6 months
six_months_ago = snapshot_date - pd.DateOffset(months=6)
df_recent = df[df["InvoiceDate"] >= six_months_ago].copy()
df_recent["InvoiceMonth"] = df_recent["InvoiceDate"].dt.to_period("M").astype(str)

# Aggregate monthly sales
monthly_sales = df_recent.groupby("InvoiceMonth")["TotalPrice"].sum().reset_index()
monthly_sales = monthly_sales.sort_values("InvoiceMonth")

# Visualize trend
px.line(monthly_sales, x="InvoiceMonth", y="TotalPrice",
        title="Monthly Sales Trend (Last 6 Months)", markers=True).show()

There’s a dip in November revenue, likely due to seasonality or stock issues. This insight can guide inventory planning and seasonal campaigns.

Customer Segmentation¶

To tailor marketing, I used Recency, Frequency, and Monetary (RFM) metrics to segment customers.

In [9]:
# RFM calculation
rfm = df.groupby("CustomerID").agg({
    "InvoiceDate": lambda x: (snapshot_date - x.max()).days,
    "InvoiceNo": "nunique",
    "TotalPrice": "sum"
}).rename(columns={"InvoiceDate":"Recency","InvoiceNo":"Frequency","TotalPrice":"Monetary"})

# Scaling
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
rfm_scaled = scaler.fit_transform(rfm[["Recency","Frequency","Monetary"]])
In [10]:
from sklearn.cluster import KMeans

# Elbow method
inertia = [KMeans(n_clusters=k, random_state=42).fit(rfm_scaled).inertia_ for k in range(1,8)]
plt.plot(range(1,8), inertia, 'o-')
plt.xlabel("k"); plt.ylabel("Inertia"); plt.title("Elbow Method")
plt.show()

Elbow analyses suggest 3 clusters as the optimal choice.

K-Means Clustering¶

In [11]:
# Train K-Means
kmeans = KMeans(n_clusters=3, random_state=42)
rfm["Segment"] = kmeans.fit_predict(rfm_scaled)

# Segment Profile
rfm_profile = rfm.groupby("Segment").agg({"Recency":"mean","Frequency":"mean","Monetary":["mean","count"]}).round(2)
print(rfm_profile)
        Recency Frequency   Monetary      
           mean      mean       mean count
Segment                                   
0         39.98      4.85    2012.11  3231
1        245.02      1.58     631.14  1093
2          6.14     80.21  122888.41    14
In [12]:
# 3D visualization
fig_rfm3d = px.scatter_3d(rfm, x='Recency', y='Frequency', z='Monetary',
                          color='Segment', size='Monetary', opacity=0.7,
                          title="3D RFM Segments")
fig_rfm3d.show()

Customer Segments:

  • VIPs (Segment 2): Loyal, high-spending → reward with perks.
  • Mid-tier (Segment 0): Regular buyers → upsell & cross-sell.
  • At-risk (Segment 1): Infrequent → reactivation campaigns.

High-Value Customer Prediction¶

I defined high-value customers as the top 25% in total spending and trained a Random Forest model to predict them.

In [13]:
# Define target
q3 = df.groupby("CustomerID")["TotalPrice"].sum().quantile(0.75)
high_value_customers = df.groupby("CustomerID")["TotalPrice"].sum() > q3
In [14]:
# Features
customer_features = df.groupby("CustomerID").agg({
"InvoiceDate": lambda x: (snapshot_date - x.max()).days,
"InvoiceNo": "nunique",
"UnitPrice": "mean",
"Quantity": "mean",
"Country": lambda x: x.mode()[0]
}).rename(columns={"InvoiceDate":"Recency","InvoiceNo":"Frequency","UnitPrice":"AvgUnitPrice","Quantity":"AvgQuantity","Country":"Country"})

customer_features = pd.get_dummies(customer_features, columns=["Country"], drop_first=True)

X = customer_features
y = high_value_customers.astype(int)

# Scaling
X_scaled = scaler.fit_transform(X)

Train Test split¶

In [15]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2,
                                                    random_state=42, stratify=y)

Random Forest Classifier¶

In [16]:
from sklearn.ensemble import RandomForestClassifier

# Train Random Forest
clf = RandomForestClassifier(n_estimators=200, random_state=42, class_weight='balanced')
clf.fit(X_train, y_train)

# Predict probabilities and classify high-value customers
y_proba = clf.predict_proba(X_test)[:,1]
y_pred = (y_proba >= 0.4).astype(int)  # threshold 0.4 to capture more high-value customers

Model Evaluation¶

In [17]:
from sklearn.metrics import classification_report, roc_auc_score, confusion_matrix

print("ROC-AUC:", roc_auc_score(y_test, y_proba).round(3))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
ROC-AUC: 0.943
Confusion Matrix:
 [[614  37]
 [ 55 162]]
              precision    recall  f1-score   support

           0       0.92      0.94      0.93       651
           1       0.81      0.75      0.78       217

    accuracy                           0.89       868
   macro avg       0.87      0.84      0.85       868
weighted avg       0.89      0.89      0.89       868

With ROC-AUC of 0.943, the model effectively identifies top spenders for focused marketing campaigns.

Cross-Selling Opportunities¶

Using FP-Growth on UK transactions, I identified products frequently bought together.

In [18]:
from mlxtend.frequent_patterns import fpgrowth, association_rules

# Pivot table & filter UK transactions
basket_sets = (df[df['Country']=="United Kingdom"]
               .groupby(['InvoiceNo','Description'])['Quantity']
               .sum().unstack().loc[:, lambda x: x.sum() > 10] > 0)

# FP-Growth & association rules
frequent_itemsets = fpgrowth(basket_sets, min_support=0.01, use_colnames=True)
rules_top = association_rules(frequent_itemsets, metric="lift", min_threshold=1.2) \
                .sort_values('lift', ascending=False).head(10)
rules_top
Out[18]:
antecedents consequents antecedent support consequent support support confidence lift representativity leverage conviction zhangs_metric jaccard certainty kulczynski
541 (HERB MARKER THYME) (HERB MARKER ROSEMARY) 0.010753 0.010873 0.010153 0.944134 86.829038 1.0 0.010036 17.705365 0.999228 0.884817 0.943520 0.938918
540 (HERB MARKER ROSEMARY) (HERB MARKER THYME) 0.010873 0.010753 0.010153 0.933702 86.829038 1.0 0.010036 14.921137 0.999350 0.884817 0.932981 0.938918
914 (REGENCY TEA PLATE GREEN ) (REGENCY TEA PLATE ROSES ) 0.013637 0.015980 0.011534 0.845815 52.930211 1.0 0.011316 6.382074 0.994671 0.637874 0.843311 0.783810
915 (REGENCY TEA PLATE ROSES ) (REGENCY TEA PLATE GREEN ) 0.015980 0.013637 0.011534 0.721805 52.930211 1.0 0.011316 3.545575 0.997040 0.637874 0.717958 0.783810
617 (POPPY'S PLAYHOUSE BEDROOM ) (POPPY'S PLAYHOUSE LIVINGROOM ) 0.015619 0.012556 0.010153 0.650000 51.769856 1.0 0.009956 2.821270 0.996244 0.563333 0.645550 0.729306
616 (POPPY'S PLAYHOUSE LIVINGROOM ) (POPPY'S PLAYHOUSE BEDROOM ) 0.012556 0.015619 0.010153 0.808612 51.769856 1.0 0.009956 5.143389 0.993153 0.563333 0.805576 0.729306
928 (SET OF 3 WOODEN STOCKING DECORATION) (SET OF 3 WOODEN TREE DECORATIONS) 0.014959 0.013757 0.010333 0.690763 50.211536 1.0 0.010127 3.189279 0.994968 0.562092 0.686450 0.720927
929 (SET OF 3 WOODEN TREE DECORATIONS) (SET OF 3 WOODEN STOCKING DECORATION) 0.013757 0.014959 0.010333 0.751092 50.211536 1.0 0.010127 3.957447 0.993755 0.562092 0.747312 0.720927
618 (POPPY'S PLAYHOUSE LIVINGROOM ) (POPPY'S PLAYHOUSE KITCHEN) 0.012556 0.017301 0.010693 0.851675 49.225611 1.0 0.010476 6.625290 0.992142 0.557994 0.849063 0.734865
619 (POPPY'S PLAYHOUSE KITCHEN) (POPPY'S PLAYHOUSE LIVINGROOM ) 0.017301 0.012556 0.010693 0.618056 49.225611 1.0 0.010476 2.585309 0.996934 0.557994 0.613199 0.734865

For example, Herb Marker Rosemary is often purchased with Herb Marker Thyme, suggesting bundling opportunities.

Conclusion¶

From ~398k transactions, we identified top-selling products and confirmed the UK as the main market. Seasonal sales dips suggest opportunities for smarter inventory planning. Customers were segmented into VIPs, mid-tier, and at-risk groups, enabling targeted retention, upselling, and re-engagement. High-value buyers were predicted with a Random Forest model, while FP-Growth revealed frequent product bundles. These insights empower marketing teams to run personalized campaigns, make data-driven decisions, and drive sustainable growth.